Agasthya Shenoy and Dustin Tingley

Introduction

Harvard University has over 10,000 websites associated with unique faculty members. Any number of these sites may contain embedded or hyperlinked video. In order to find these videos, we created a Python script to scrape the html code from each website and its subpages. We then parsed the code, searching for html tags indicitave of the presence of videos. Our results and further explanation of our methods are summarised below.


Methods

Defining ‘video content’

We defined video content as any video embedded within an <iframe> tag (used to embed videos in a webpage) or explicitly hyperlinked with an <a> tag. We did not include links to content creator user pages (e.g. ‘https://www.youtube.com/harvard’), only links that directly went to a video or video playlist. Using this definition, we found that videos mostly came from YouTube.com and Vimeo.com, while a few embedded videos used the Kaltura.com platform (see Results for a summary).

Aquiring faculty websites

Websites had been previously collected for the VPAL-R LINK project. Help in collection was provided by other entities at the University for the purpose of LINK.

Faculty websites were collected through 3 channels:

  1. an initial list of sites provided by Faculty Development & Diversity

  2. manually collected webpages done by VPAL-Research

  3. Harvard Open Scholar (Harvard web publishing platform)

We defined ‘faculty’ as anyone with the position ‘Lecturer’ and above - meaning that people like graduate students and teaching fellows, who may still have their own websites, were excluded from our analysis. We have not actively collected the websites for these people.

Scraping and Cleaning the data

Scraping

Once the websites were collected, we used a Python script, called a “crawler”, to download the raw html from each website, including website subpages (e.g. ‘https://scholar.harvard.edu/dtingley/’ AND ‘https://scholar.harvard.edu/dtingley/publications’, etc.) This data was stored in an Amazon Web Services S3 bucket, which we use for the LINK project. Each page is associated with a HUID and a checksum - a unique code created from the data contained in the website:

Example Data Structure (Website Collection)
person_id url checksum
40809295 https://scholar.harvard.edu/dtingley 1c20b7ba5e98adcd6e49ce345ae35042
40809295 https://scholar.harvard.edu/dtingley/contact_owner 91ef7d4b8de0e5bfcbe1270a3a5c61f2
40809295 https://scholar.harvard.edu/dtingley/publications/fast-cheap-and-imperfect-us-public-opinion-about-solar-geoengineering-0 bf020c95d094582f830ac6fc2382b9b3
40809295 https://scholar.harvard.edu/dtingley/publications/sparse-multilevel-regression-and-poststratification-smrp fce6f1ad2dfd465f548a4ba168482870
40809295 https://scholar.harvard.edu/dtingley/publications/effects-adaptive-learning-massive-open-online-course-learners%E2%80%99-skill 9d3d24ccc186bbcbdd276a727f2e3ec0
40809295 https://scholar.harvard.edu/dtingley/publications/effects-environmental-stressors-daily-governance 469b85a76149d1b25bb2b1940ce50ba4

We used the checksum to identify duplicates: because the checksum is created based on the raw content of a webpage, pages with identical checksums also have identical content - even if the urls are different. After identifying duplicate pages, we deleted them to avoid double counting videos. After the de-duplication process, we were left with 157,159 unique webpages.

Next, we used a Python package called Beautiful Soup to parse the raw html code. We first looked for embedded videos by searching for all instances of the <iframe> html tag that also contained the term “youtube”, “vimeo”, or “player”. Sometimes, <iframe> tags with “player” were actually embedded SoundCloud players, so “soundcloud” was an excluded term.

Once an embedded video was found, we incremented a “embedded_video” counter, and extracted an html snippet that contained the video, e.g:

<iframe allowfullscreen="allowfullscreen" frameborder="1" height="205" src="https://www.youtube.com/embed/mh65upNaAqM" width="400"></iframe>

We then used the same process to find hyperlinked videos, this time looking for <a> tags that contained the term “youtube” or “vimeo”. A separate “hyperlinked_video” counter was incremented each time we found a video on a page, and we extracted the html snippet associated with it as well:

<a href=\"https://www.youtube.com/watch?v=mh65upNaAqM" target=\"_blank\">online</a>

After this stage was complete, we added the embedded and hyperlinked counters to get the number of total videos, and stored the video counts and the html snippets in a .csv file:

Example Data Structure (Video Counts)
person_id url total_videos embedded_html hyperlinked_html
40809295 https://scholar.harvard.edu/dtingley/publications/what-makes-foreign-policy-teams-tick-explaining-variation-group-performance 0
40809295 https://scholar.harvard.edu/dtingley/publications/export/xml/374206 0
40809295 https://scholar.harvard.edu/dtingley/software/txtorg 0
40809295 https://scholar.harvard.edu/dtingley/publications/plain-text-transparency-acquisition-analysis-and-access-stages-computer 0
40809295 https://scholar.harvard.edu/dtingley/publications/computer-assisted-text-analysis-comparative-politics 0
40809295 https://scholar.harvard.edu/dtingley/publications/public-opposition-foreign-acquisitions-domestic-companiesevidence-united 0
Note:
In this case, Dustin did not have any videos on his website. If videos were present, the 'embedded_html' and 'hyperlinked_html' fields would be populated.

Cleaning

Finally, we merged the above data source with an exisiting data source of biographical information. This data source was also created for the LINK project, and contained information about faculty members’ school associations. During this merge, we lost 700 unique HUIDs from the data. Upon further investigation, it became clear that these 700 HUIDs were associated with non-faculty members (e.g. graduate students, former students, etc). These people were present in our original data because anyone with access to HarvadKey can create an OpenScholar website - even if they are not faculty members.

When doing a final pass at the extracted code, we noticed that many faculty websites include a footer designed by their respective schools. These footers often included a link to the school’s YouTube user page. These schools can be found below:

Institutional YouTube Channels
School YouTube Link in Footer
Harvard Graduate School of Education youtube.com/user/HarvardEducation
Harvard Law School youtube.com/user/HarvardLawSchool
Harvard Divinity School youtube.com/user/HarvardDivinity
Harvard Kennedy School youtube.com/user/HarvardKennedySchool
Harvard Chan School of Public Health youtube.com/user/HarvardPublicHealth
Harvard Extension School youtube.com/user/HarvardExtension
Harvard Graduate School of Design youtube.com/user/TheHarvardGSD

Any faculty member belonging to the above schools has a footer on their website directing to the school’s YouTube page. For this reason, these links were not included in the below analysis.


Results

Roughly 2% of faculty members across Harvard University have videos on their websites, which translates to 248 faculty members with a total of 1806 videos across 796 unique urls.

Click the tabs below to view the different analyses.

Types of Videos

Type of Video
hyperlinked embedded
YouTube 866 416
Vimeo 384 113
Kaltura - 27

Percent of Websites with Videos

By “number of websites”, we mean “number of unique urls”. This means that many urls can be associated with just one faculty member.


Number of Websites with Videos (Bar Chart)

By “number of websites”, we mean “number of unique urls”. This means that many urls can be associated with just one faculty member.


Number of Websites with Videos (Table)

School Total Number of Sites with Videos Number with Hyperlinks Number with Embeds
BUS 33 32 1
OPR 1 1 0
DIV 3 2 2
SEAS 118 70 53
RAD 1 1 0
SPH 30 22 11
EDU 10 10 0
DES 1 0 1
HSDM 2 2 0
KSG 119 61 61
FAS 418 294 155
LAW 20 20 0
Visiting Faculty/Postdoc 13 11 2
HMS 27 5 22
Note:
Some sites may have both hyperlinks AND embedded video, so the sum of the last 2 columns may be greater than the first.

Limitations and looking forward

Under counting

  • When looking for hyperlinked videos, we only looked for links to YouTube.com and Vimeo.com. This decision was made after seeing that these were the sources of the majority of the embedded videos. Although these are the two leading video platforms not just at Harvard, but on the internet in general, it is possible that some faculty members may have links to videos hosted on other, lesser known websites.

  • It is possible that we have under-counted videos from the Harvard Medical School. HMS faculty use the Harvard Catalyst platform to create their websites. As far as we have been able to ascertain, this platform does not allow faculty members to post videos. If this functionality does in fact exist, or if HMS faculty members have websites we were not able to collect, our total count may be off.

  • Our database of fqaculty websites may not be comprehensive - faculty members may have websites we do not know of. As the LINK project progresses, our database of faculty websites will improve.

  • There were at least two cases where we lost legitimate data when merging the scraped data set and the biographical data set. This was because someone else (possibly an assistant) had set up an OpenScholar site for a faculty member using their own HUID, rather than the faculty member’s. While we believe we caught all instances of this (2), it is possible we missed more.

Looking forward

We are currently in the process of crawling, scraping, and analyzing department and organization websites. Once that data is processed, we will generate another report.

---
title: "Videos in Faculty Websites"
output:
  html_notebook:
    theme: cosmo
    toc: yes
    toc_depth: 3
    toc_float: yes
---
####Agasthya Shenoy and Dustin Tingley
```{r loadlibs, include=FALSE}
library(ggplot2)
library(dplyr)
library(data.table)
library(plotly)
library(knitr)
library(kableExtra)
data <- read.csv('data/in_progress.csv',stringsAsFactors = F)
nrow(data)
```

```{r, echo = FALSE}
# scrub out bad urls (duplicates)
bad_urls <- data[data$url %like% "\\?" & data$total_videos > 0,]
dont_delete <- bad_urls %>% group_by(person_id) %>% 
  summarise(n_links = n()) %>% 
  subset(n_links < 2)
bad_urls <- bad_urls[!(bad_urls$person_id %in% dont_delete$person_id),]
bad_urls <- rownames(bad_urls[bad_urls$total_videos>0,])
cleaner_data <- data[-as.numeric(bad_urls),]

# merge with bio_data to get school affiliation
bio_data <- read.csv('data/PersonLinks_Bio.csv',stringsAsFactors = F)

with_school <- merge(cleaner_data,bio_data[,c(2,9)],by.x = "person_id",by.y = "id")
with_school <- distinct(with_school)
with_school$hyperlinked_html <- gsub("\\[]","",with_school$hyperlinked_html)
with_school$embedded_html <- gsub("\\[]","",with_school$embedded_html)

remove_footers <- function(x){
  links <- strsplit(x,">,")[[1]]
  
  links <- links %>%  lapply(function(x) paste(x,">",sep=""))
  
  links <- unlist(links)
  found <- grep("HarvardEducation|HarvardLawSchool|HarvardDivinity|HarvardKennedySchool|HarvardPublicHealth|HarvardExtension|TheHarvardGSD",links)
  
  if(length(found)>0){
    links <- links[-found]
  }
  links <- paste(links,collapse=",")
  unlist(links)
}

split_html <- function(x){
  strsplit(x,">,")[[1]]
}

count_html <- function(x){
  length(split_html(x))
}

# with_school$hyperlinked_html <- lapply(with_school$hyperlinked_html,remove_footers)

real_num_html <- lapply(with_school$hyperlinked_html,count_html)
with_school$real_num_html <- unlist(real_num_html)
with_school$real_total <- with_school$num_embedded + with_school$real_num_html

write.csv(with_school,"faculty_kaltura_pre.csv")
```

#Introduction

Harvard University has over 10,000 websites associated with unique faculty members. Any number of these sites may contain embedded or hyperlinked video. In order to find these videos, we created a Python script to scrape the html code from each website and its subpages. We then parsed the code, searching for html tags indicitave of the presence of videos. Our results and further explanation of our methods are summarised below.

---

#Methods

###Defining 'video content'

We defined video content as any video embedded within an `<iframe>` tag (used to embed videos in a webpage) or explicitly hyperlinked with an `<a>` tag. We did not include links to content creator *user pages* (e.g. 'https://www.youtube.com/harvard'), only links that directly went to a video or video playlist. Using this definition, we found that videos mostly came from YouTube.com and Vimeo.com, while a few embedded videos used the Kaltura.com platform (see [Results](#Results) for a summary).

```{r, echo=FALSE}

# hyperlinked
allurls_hyp <- lapply(with_school$hyperlinked_html,split_html)
allurls_hyp <- unlist(allurls_hyp)
yt_hyp <- length(grep("youtube.com",allurls_hyp))
vimeo_hyp <- length(grep("vimeo.com",allurls_hyp))

#embedded
allurls_emb <- lapply(with_school$embedded_html,split_html)
allurls_emb <- unlist(allurls_emb)

yt_emb <- length(grep("youtube.com",allurls_emb))
vimeo_emb <- length(grep("vimeo.com",allurls_emb))

both_ytvim <- grep("youtube.com|vimeo.com",allurls_emb)
nonyt <- allurls_emb[-both_ytvim]
nonyt <- length(nonyt[nonyt != ""])

vid_sources <- data.frame(hyperlinked = c(yt_hyp,vimeo_hyp, "-"),embedded=c(yt_emb,vimeo_emb,nonyt))
rownames(vid_sources) <- c("YouTube","Vimeo","Kaltura") 
```

###Aquiring faculty websites

Websites had been previously collected for the [VPAL-R LINK](https://sites.google.com/view/facultylink) project. Help in collection was provided by other entities at the University for the purpose of LINK. 

Faculty websites were collected through 3 channels: 

(1) an initial list of sites provided by Faculty Development & Diversity

(2) manually collected webpages done by VPAL-Research

(3) Harvard Open Scholar (Harvard web publishing platform)

We defined 'faculty' as anyone with the position 'Lecturer' and above - meaning that people like graduate students and teaching fellows, who may still have their own websites, were excluded from our analysis. We have not actively collected the websites for these people.

###Scraping and Cleaning the data

####Scraping
Once the websites were collected, we used a Python script, called a "crawler", to download the raw html from each website, including website subpages (e.g. 'https://scholar.harvard.edu/dtingley/' AND 'https://scholar.harvard.edu/dtingley/publications', etc.) This data was stored in an Amazon Web Services S3 bucket, which we use for the LINK project. Each page is associated with a HUID and a checksum - a unique code created from the data contained in the website:

```{r, echo=FALSE}
personresource <- read.csv('data/person_resource.csv',stringsAsFactors = F)
dting <- head(personresource[grep("dtingley",personresource$url),-c(1,3,5)])

kable(dting, row.names = FALSE) %>% kable_styling(bootstrap_options = c("striped")) %>% 
  add_header_above(c("Example Data Structure (Website Collection)"=3),underline = TRUE)
```
We used the checksum to identify duplicates: because the checksum is created based on the raw content of a webpage, pages with identical checksums also have identical content - even if the urls are different. After identifying duplicate pages, we deleted them to avoid double counting videos. After the de-duplication process, we were left with **157,159** unique webpages.

Next, we used a Python package called [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to parse the raw html code. We first looked for embedded videos by searching for all instances of the `<iframe>` html tag that also contained the term "youtube", "vimeo", or "player". Sometimes, `<iframe>` tags with "player" were actually embedded SoundCloud players, so "soundcloud" was an excluded term.

Once an embedded video was found, we incremented a "embedded_video" counter, and extracted an html snippet that contained the video, e.g:

`<iframe allowfullscreen="allowfullscreen" frameborder="1" height="205"` 
`src="https://www.youtube.com/embed/mh65upNaAqM" width="400"></iframe>`

We then used the same process to find hyperlinked videos, this time looking for `<a>` tags that contained the term "youtube" or "vimeo". A separate "hyperlinked_video" counter was incremented each time we found a video on a page, and we extracted the html snippet associated with it as well:

`<a href=\"https://www.youtube.com/watch?v=mh65upNaAqM" target=\"_blank\">online</a>`

After this stage was complete, we added the embedded and hyperlinked counters to get the number of total videos, and stored the video counts and the html snippets in a .csv file:

```{r, echo=FALSE}
dtresults <- head(with_school[grep("dtingley",with_school$url),-c(4,6,8,9,10)])
kable(dtresults,row.names = FALSE) %>% kable_styling(bootstrap_options = c("striped")) %>% 
  add_header_above(c("Example Data Structure (Video Counts)"=5),underline = TRUE) %>% 
  footnote(general = "In this case, Dustin did not have any videos on his website. If videos were present, the 'embedded_html' and 'hyperlinked_html' fields would be populated.")
```

####Cleaning

Finally, we merged the above data source with an exisiting data source of biographical information. This data source was also created for the LINK project, and contained information about faculty members' school associations. During this merge, we lost 700 unique HUIDs from the data. Upon further investigation, it became clear that these 700 HUIDs were associated with non-faculty members (e.g. graduate students, former students, etc). These people were present in our original data because anyone with access to HarvadKey can create an OpenScholar website - even if they are not faculty members.

When doing a final pass at the extracted code, we noticed that many faculty websites include a footer designed by their respective schools. These footers often included a link to the *school's* YouTube user page. These schools can be found below:

```{r, echo=FALSE}
inst_ytlinks <- c("HarvardEducation",
"HarvardLawSchool",
"HarvardDivinity",
"HarvardKennedySchool",
"HarvardPublicHealth",
"HarvardExtension",
"TheHarvardGSD")

inst_names <- c("Harvard Graduate School of Education", "Harvard Law School", "Harvard Divinity School",
                "Harvard Kennedy School", "Harvard Chan School of Public Health",
                "Harvard Extension School","Harvard Graduate School of Design")

inst_ytlinks <- inst_ytlinks %>% sapply(FUN=function(x){paste('youtube.com/user/',x,sep="")})
inst_ytlinks <- unlist(inst_ytlinks)

institutions <- data.frame(inst_ytlinks)
rownames(institutions) <- inst_names

institutions <- institutions %>% mutate(
  inst_ytlinks = text_spec(inst_ytlinks, link = inst_ytlinks),
  school = row.names(.)
)
institutions <- institutions[,c(2,1)]
names(institutions) <- c("School","YouTube Link in Footer")
```

```{r, echo = FALSE}
kable(institutions, escape=F) %>% 
  kable_styling(bootstrap_options = c("striped","hover"),full_width = F,position="center") %>% 
  add_header_above(c("Institutional YouTube Channels" = 2),underline=TRUE)
```
Any faculty member belonging to the above schools has a footer on their website directing to the school's YouTube page. For this reason, these links **were not included** in the below analysis.

---

<a name="Results"></a>

#Results {.tabset .tabset-fade .tabset-pills}  


####Roughly **2%** of faculty members across Harvard University have videos on their websites, which translates to **248** faculty members with a total of **1806** videos across **796** unique urls. 

Click the tabs below to view the different analyses. 

##Types of Videos

```{r, echo=FALSE}
kable(vid_sources) %>% 
  kable_styling(bootstrap_options = c("striped","hover"),full_width = F,position="center") %>% 
  add_header_above(c(" " = 1, "Type of Video" = 2))
```


```{r, echo=FALSE}
for_sum <- with_school %>% 
  mutate(has_vid = case_when(
    real_total > 0 ~ 1,
    TRUE ~ 0
  ),
  hyper_vid = case_when(
    num_hyperlinked > 0 ~ 1,
    TRUE ~ 0
  ),
  embed_vid = case_when(
    num_embedded > 0 ~ 1,
    TRUE ~ 0
  )) %>% group_by(school_final) %>% 
  summarise(mean_total = mean(has_vid)*100,
            n_urls = sum(has_vid),
            mean_hyper = mean(hyper_vid)*100,
            n_hyper = sum(hyper_vid),
            mean_embed = mean(embed_vid)*100,
            n_embed = sum(embed_vid))

for_sum$school_final[1] <- "Visiting Faculty/Postdoc"
for_sum <- for_sum[order(-for_sum$mean_total),]

x <- list(
  title = "School"
)
y <- list(
  title = "Percent of Websites with Videos"
)

y2 <- list(
  title = "Number of Websites with Videos"
)

```

---

##Percent of Websites with Videos

By "number of websites", we mean "number of unique urls". This means that many urls can be associated with just one faculty member.
```{r, echo=FALSE}
plot_ly(for_sum,x=~school_final,y=~mean_total,type="bar",hovertext=paste("Actual Number of Sites:", for_sum$n_urls),xaxis="School", width= 1000, height = 500) %>% 
  layout(xaxis=x,yaxis=y)
```

---

##Number of Websites with Videos (Bar Chart)

By "number of websites", we mean "number of unique urls". This means that many urls can be associated with just one faculty member.

```{r, echo=FALSE}
plot_ly(for_sum,width=1000,
         height = 500) %>% 
  add_trace(x = ~school_final,y=~n_urls,type='bar',name = 'Total Number',
            hoverinfo = "text",
            text = ~paste(school_final, ': ',n_urls,' total sites with video',sep=""),xaxis="School") %>% 
  add_trace(x = ~school_final, y=~n_hyper,type='bar',name='Number with Hyperlinks',
            hoverinfo = "text",
            text = ~paste(school_final, ': ',n_hyper,' sites with hyperlinks',sep="")) %>% 
  add_trace(x=~school_final, y=~n_embed,type='bar',name='Number with Embeds',
            hoverinfo = "text",
            text = ~paste(school_final, ': ',n_embed,' sites with embeds',sep="")) %>% 
  layout(xaxis = x,
         yaxis = y2,
         autosize = F)


# plot_ly(for_sum,x=~school_final,y=~n_urls,type="bar",hovertext=paste("Percent:", format(round(for_sum$mean, 2), nsmall = 2)),xaxis="School")%>% 
#   layout(xaxis=x,yaxis=y2)
```

---

##Number of Websites with Videos (Table)

```{r, echo=FALSE}
#number of sites table
display_for_sum <- for_sum[,c(1,3,5,7)]
names(display_for_sum) <- c("School", "Total Number of Sites with Videos", "Number with Hyperlinks", "Number with Embeds")
kable(display_for_sum) %>%
  kable_styling(bootstrap_options = c("striped","hover"),full_width = F,position="center") %>% 
  footnote(general="Some sites may have both hyperlinks AND embedded video, so the sum of the last 2 columns may be greater than the first.")
  
```

---

#Limitations and looking forward

####Under counting
- When looking for hyperlinked videos, we only looked for links to YouTube.com and Vimeo.com. This decision was made after seeing that these were the sources of the majority of the embedded videos. Although these are the two leading video platforms not just at Harvard, but on the internet in general, it is possible that some faculty members may have links to videos hosted on other, lesser known websites.

- It is possible that we have under-counted videos from the Harvard Medical School. HMS faculty use the [Harvard Catalyst](https://catalyst.harvard.edu/) platform to create their websites. As far as we have been able to ascertain, this platform does not allow faculty members to post videos. If this functionality does in fact exist, or if HMS faculty members have websites we were not able to collect, our total count may be off.

- Our database of fqaculty websites may not be comprehensive - faculty members may have websites we do not know of. As the LINK project progresses, our database of faculty websites will improve.

- There were at least two cases where we lost legitimate data when merging the scraped data set and the biographical data set. This was because someone else (possibly an assistant) had set up an OpenScholar site for a faculty member using their own HUID, rather than the faculty member's. While we believe we caught all instances of this (2), it is possible we missed more.

####Looking forward

We are currently in the process of crawling, scraping, and analyzing department and organization websites. Once that data is processed, we will generate another report.

```{r, include=FALSE}
write.csv(with_school,"with_school.csv")
```

